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Abstract 

We study clustering on graphs with multiple edge types. Our main motivation is that similarities 
between objects can be measured in many different metrics. For instance similarity between two papers 
can be based on common authors, where they are published, keyword similarity, citations, etc. As 
such, graphs with multiple edges is a more accurate model to describe similarities between objects. 
Each edge/metric provides only partial information about the data; recovering full information requires 
aggregation of all the similarity metrics. Clustering becomes much more challenging in this context, 
since in addition to the difficulties of the traditional clustering problem, we have to deal with a space 
of clusterings. We generalize the concept of clustering in single-edge graphs to multi-edged graphs 
and investigate problems such as: Can we find a clustering that remains good, even if we change the 
relative weights of metrics? How can we describe the space of clusterings efficiently? Can we find 
unexpected clusterings (a good clustering that is distant from all given clusterings)? If given the ground- 
truth clustering, can we recover how the weights for edge types were aggregated? 

1 Introduction 

A community or a cluster in a graph is a subset of vertices that are tightly coupled among themselves and 
loosely coupled with the rest of the graph. Finding these communities is one of the fundamental problems of 
graphs analysis and has been the subject of numerous research efforts. Most of these efforts begin with the 
premise that a simple graph is already constructed. A relation between two objects is represented by an edge 
between two nodes which may be weighted by the strength of the connection or left as a binary variable. This 
paper studies the community detection problem on graphs with multiple edge types or multiple similarity 
metrics, as opposed to traditional graphs with a single edge type. 

In many real-world problems, similarities between objects can be defined by many different relation- 
ships. For instance, similarity between two scientific articles can be defined based on authors, citations to, 
citations from, keywords, titles, where they are published, text similarity and many more. Relationships 
between people can be based on the nature of the relationship (e.g., business, family, friendships) a means 
of communication (e.g., emails, phone, in person), etc. Electronic files can be grouped by their type (Latex, 
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National Laboratories, a multiprogram laboratory operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin 
Corporation, for the United States Department of Energy's National Nuclear Security Administration under contract DE-AC04- 
94AL85000. 
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C, html), names, the time they are created, or the pattern they are accessed. In these examples, there are 
multiple graphs that define relationships between the subjects. We may choose to reduce this multivariate 
information to construct a single composite graph. This is convenient as it enables application of many 
strong results from the literature. However, information being lost during this aggregation may be crucial, 
and we believe working on graphs with multiple edge types is a more precise representation of the problem, 
and thus will lead to more accurate analyses. Despite its importance, the literature on clustering graphs with 
multiple edge types is very rare. Mucha et al. 1 13 1 looked at community detection when multiple edge types 
are sampled in time and strongly correlated. Dunlavy et al. 1 5 1 described this problem as a three dimensional 
Tensor and used a PARAFAC decomposition (SVD generalization) to identify dominant factors. 

The community detection problem on graphs with multiple edge types bears many interesting problems. 
If a ground-truth clustering is known, can we recover an aggregation scheme that best resonates with the 
ground-truth data? How can we efficiently represent a space of clusterings spanned by the multiple edge 
types. Are the clusterings within this space clustered themselves? How do we find significantly different 
clusterings for the same data? These problems add another level of complexity to the already difficult 
problem of community detection in graphs. As in the single edge type case, the challenges lie not only in 
algorithms, but also in formulations of these problems. In this paper, we investigate these problems and 
propose some solutions. Our techniques rely on using optimization (specifically methods that rely only on 
function values), methods for classical community detection (i.e., community detection with single edge 
types), and metrics for quantifying the distance between two clusterings. We present results of three case 
studies on : 1) file system data, where files are grouped to projects; 2) papers submitted to Arxiv, and 3) 
countries where the data is based on CIA World Factbook |2|. 

The rest of the paper is organized as follows. In the next section, we will review some of the background 



information for this paper. In particular, we present the variation of information metric [ [TT| that we use 
to quantify the distance between two clusterings. In Section |3] we study the problem of computing an 
aggregate similarity measure for a given ground-truth clustering Given a graph with multiple edge types 
and the ground-truth clustering, can we find an aggregation scheme to reduce the information from multiple 
edge types into a single metric, such that this metric best resonates with the ground-truth data? We apply 
our methods to synthetic, and the file system data. Section |4] addresses the latent clustering structure on 
graphs with multiple edge types, and introduces the concept of meta-clustering. We also discuss how we 
can represent the meta-clustering structure efficiently, and provide results on Arxiv data. In Section [5] we 
discuss how we can seek unexpected clusters and how we can improve the significance of clusters by for 
multiple edge types. We conclude with Section |6] 



2 Background 

A weighted graph is represented as a tuple G = (V^ E), V set of vertices and E a set of edges. Each 
edge Ci is a tuple = {va^ Vb^^i I ^a^^b ^ ^ ^} representing a connection between vertices Va and 

vi) with weight Wi. In this work we replace it;^ G M with Wi G with k being the number of edge types. 
We will refer to such graphs as graphs with multiple edge types or multiweighted graphs. We will construct 
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functions that map multiweighted edges Wi E to composite edge types f{wi)=Ui^R. For much of this 
paper / will be linear oji^J^^i^i- 



2.1 Clustering 

Intuitively, the goal of clustering is to break down the graph into smaller groups such that vertices in each 
group are tightly coupled among themselves and loosely coupled with the remainder of the graph. Both the 
translation of this intuition into a well-defined mathematical formula and design of associated algorithms 
pose significant challenges. Despite the high quality and the high volume of the literature, the area continues 
to draw a lot of interest due to the growing importance of the problem and the challenges posed by the size 
and mathematical variety of the subject graphs. 

Our goal is to extend the concept of clustering to graphs with multiple edge types without getting into 
the details of clustering algorithms and formulations, since such a detailed study will be well beyond the 
scope of this paper. In this paper, we used Graclus, software developed by Dhillon et al |j4j|, which uses the 
top-down approach that recursively splits the graph into smaller pieces and FastCommunity developed by 
Clauset et al [^J which uses an agglomerative approach which optimizes the modularity metric. For further 
information on clustering see Lancichinetti et al. ||6|. 



2.2 Variation of information of clusterings 

At the core of most of our discussions will be similarity between two clusterings, which calls for a method 
to quantify the distance between two clusterings. Several metrics and methods have been proposed for 
comparing clusterings, such as variation of information [ [TT| , scaled coverage measure 1 17|, classification 



error [|8l [T0l[TT| , and Mirkin's metric [ 12|. Out of these, we have used the variation of information metric in 
our experiments. 

Let Co = (Co\ C^, . . . , C^) and Ci = {Cl,Cf, . . . , Cf ) be two clusterings of the same node set. 
Let n be the total number of nodes, and P(C, k) = be the probability that a node is in cluster 

in a clustering C. Similarly the probability that a node is in cluster in clustering Ci and in cluster 

ic^nc^- 1 

in clustering Cj is P{Ci,Cj,k,l) = \ ^ . The entropy of information or expectation value of learned 
information in Ci is defined 

K 

H{C,) = - V k) log P(a, k) 



k=i 



the mutual information shared by Ci and Cj is 

K Kf 

k=l 1=1 

Given these two quantities Meila defines the variation of information metric by 

dvi{C,, Cj) = H{C,) + H{Cj) - 2/(a, Cj). (1) 
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Meila [[TT| explains the intuition behind this metric a follows. H{Ci) denotes the average uncertainty of the 
position of a node in clustering Ci. If, however, we are given Cj, I{Ci, Cj) denotes average reduction in 
uncertainty of where a node is located in Ci. If we rewrite Equation ([T]) as 

dviiCi, Cj) = (Hid) - lid, Cj)) + (HiCj) - lid, Cj)) , 

the first term measures the information lost if Cj is the true clustering and we know instead Ci, and the 
second term is the opposite. Note that dvi{Ci^ Cj) will be zero, when the two clusterings are the same, and 
it will be maximum when the two clusterings are independent. 

The variation of information metric can be computed in 0{n) time. 

3 Recovering a graph given a ground truth clustering 

Suppose we are given a ground-truth clustering for a graph with multiple edge types/similarity metrics. 
Can we recover an aggregation scheme that best resonates with the ground-truth data? This aggregation 
scheme that reduces multiple similarity measurements into a single similarity measurement can be a crucial 
enabler that reduces the problem of finding communities with multiple similarity metrics, to a well-known, 
fundamental problem in data analysis. Additionally if we can obtain this aggregation scheme from data sets 
for which the ground-truth is available, we may then apply the same aggregation to other data instances in 
the same domain. 

Formally, we work on the following problem. Given a graph G — {V^ E) with multiple similarity 
measurements for each edge {wj^wf, . . . , wf) G M^, and a ground-truth clustering for this graph C*. 
Our goal is to find a weighting vector a G M^, such that the C* is an optimal clustering for the graph G, 
whose eds es are weighted as Wi — ^o^i ^j'^i- Note that this is only a semi-formal definition, as we have 
not formally defined what we mean by an optimal clustering. In addition to the well-known difficulty of 
defining what a good clustering means, matching to the ground-truth data has specific challenges, which we 
discuss in the subsequent section. 

Below, we describe two approaches. The first approach is based on inverse problems, and we try to find 
weighting parameters for which the clustering on the graph yields the ground-truth clustering. The second 
approach computes weighting parameters that maximizes the quality of the ground-truth clustering. 

3.1 Solving an inverse problem 

Inverse problems arise in many scientific computing applications where the goal is to infer unobservable 
parameters from finite observations. Solutions typically involve iterations of predictions and solution of the 
forward problems to compute the accuracy of the prediction. Our problem can be considered as an inverse 
problem, since we are trying to compute an aggregation function, from a given clustering. The forward 
problem in this case is the clustering operation. We can start with a random guess for the edge weights, 
cluster the new graph, and use the distance between two clusterings as a measure for the quality of the 
guess. We can further put this process within an optimization loop to find the parameters that yield the 
closest clustering to the ground-truth. 
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The disadvantage of this method is that it rehes on the accuracy of the forward solution, i.e., the clus- 
tering algorithm. If we are given the true solution to the problem, can we construct the same clustering? 
This will not be easy for two reasons. First, there is no perfect clustering algorithm, and secondly, even if 
we were able to solve the clustering problem optimally, we would not have the exact objective function for 
clustering. Also, solving many clustering problems will be time-consuming especially for large graphs. 

3.2 Maximizing the quality of ground-truth clustering 

An alternative approach is to find an aggregation function that maximizes the quality of the ground-truth 
clustering. For this purpose, we have to take into account not only the overall quality of the clustering, but 
also the placement of individual vertices, as the individual vertices represent local optimality. For instance, 
if the quality of the clustering will improve by moving a vertex to another cluster than its ground-truth, then 
the current solution cannot be ideal. While it is fair to assume some vertices might have been misclassified 
in the ground-truth data, there should be a penalty for such vertices. Thus we have two objectives while 
computing a: (i) justifying the location of each vertex (ii) maximizing the overall quality of the clustering. 

3.2.1 Justifying locations of individual vertices 

For each vertex v we define the pull to each cluster in C = (C^, C^, . . . C^) to be the cumulative 
weights of edges between v and its neighbors in C^, 



We further define the holding power, Ha{v) for each vertex, to be the pull of the cluster to which the vertex 
belongs in C* minus the next largest pull among the remaining clusters. If this number is positive then v 
is held more strongly to the proper cluster than to any other. We can then maximize the number of vertices 
with positive holding power by maximizing \{v : Ha{v) > 0}|. What is important here is the concept of 
pull and hold, as the specific definitions may be changed without altering the core idea. 






(2) 



Wi = {u,v)eE;ueC^ 



Figure 1 : Arctangent provides a smooth blend be- 
tween step and linear functions. 




Holding Power Response Functions 



While this method is local and easy to compute, its 
discrete nature limits the tools that can be used to solve 
the associated optimization problem. Because gradi- 
ent information is not available it hinders our ability to 
navigate in the search space. In our experiments, we 
smoothed the step-like nature of the function H{v) > 
by replacing it with SiTctSin{Ha{v)). This functional 
form still encodes that we want holding power to be pos- 
itive for each node but it allows the optimization rou- 
tine to benefit from small improvements. It emphasises 
nodes which are close to the H{v) = crossing point 
(large gradients) over nodes which are well entrenched 
(low gradients near extremes). 
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This objective function sacrifices holding scores for nodes which are safely entrenched in their cluster 
(high holding power) or are lost causes (very low holding power) for those which are near the cross-over 
point. The extent to which it does this can be tuned by the steepness parameter of the arctangent. For very 
steep parameters this function resembles classification (step function) while for very shallow parameters it 
resembles a simple linear sum as seen in Fig. [1} We can solve the following optimization problem to max- 
imize the number of vertices, whose positions in the ground-truth clustering are justified by the weighting 
vector a. 

arg max > arctan(i7Q,(i;)) (3) 

3.2.2 Overall clustering quality 

In addition to individual vertices being justified, overall quality of the clustering should be maximized, as 
well. Any quality metric can potentially be used for this purpose however we find that some strictly linear 
functions have a trivial solution. Consider an objective function that measures the quality of a clustering 
as the sum of the inter-cluster edges. To minimize the cumulative weights of cut edges, or equivalently to 
maximize the cumulative weights of internal edges we solve 

min Ee,GCn^E^=lQ^^S• l«l = 1 



where Cut denotes the set of edges whose end points are in different clusters. Let denote the sum of the 
cut edges with respect to the fc-th metric. That is = ^cjeCut ^j- Then the objective function can be 
rewritten as minc^ c^kS^- Because this is linear it has a trivial solution that assigns 1 to the weight of the 
maximum S^, which means only one similarity metric is taken into account. While additional constraints 
may exclude this specific solution, a linear formulation of the quality will always yield only a trivial solution 
within the new feasible region. 

In our experiments we used the modularity metric ||T5|. The modularity metric uses a random graph 
generated with respect to the degree distribution as the null hypothesis, setting the modularity score of a 
random clustering to 0. Formally, the modularity score for an unweighted graph is 



2m ^ 



didj 
4m 



S^J. (4) 



where e^j is a binary variable that is 1, if and only if vertices Vi and vj are connected; di denotes the degree 
of vertex z; m is the number of edges; and 6ij is a binary variable that is 1, if and only if vertices Vi and Vj 
are on the same cluster. In this formulation, corresponds to the number of edges between vertices Vi 
and Vj in a random graph with the given degree distribution, and its subtraction corresponds to the the null 
hypothesis. 

This formulation can be generalized for weighted graphs by redefining e^j as the weight of this edge (0 
if no such edge exists), di as the cumulative weight of edges incident to Vi; and m as the cumulative weight 
of all edges in the graph [14]. 
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3.2.3 Solving the optimization problems 

We have presented several nonlinear optimization problems for which the derivative information is not avail- 
able. To solve these problems we used HOPSPACK (Hybrid Optimization Parallel Search PACKage) |[T6), 
which is developed at Sandia National Laboratories to solve linear and nonlinear optimization problems 
when the derivatives are not available. 



3.3 Experimental results 
3.3.1 Recovering edge weights 

The goal of this set of experiments is to see whether we can find aggregation functions that justify a given 
clustering. We have performed our experiments on 3 data sets. 

Synthetic data: Our generation method is based on Lancichinetti et al.'s work fj] that proposes a method to 
generate graphs as benchmarks for clustering algorithms. We generated graphs of sizes 500, 100, 2000, and 
4000 nodes, 30 edges per node on average, mixing parameters /i^ = .7, /x^y = .75, and known communities. 
We then perturbed edge weights, Wi, with additive and multiplicative noise so that Wi ^ u(wi + a) : a G 
{—2wa^ 2wa)^ y G (0, 1) uniformly, independently, and identically distributed, where Wa is the average edge 
weight. 



Testing Optimization on Syntlietic Data - 500 Nodes 



Testing Optimization on Syntlietic Data - 2000 Nodes 
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Figure 2: Three histograms of holding powers for Blue: an example perturbed (poor) edge type, Green: the original 
data (very good), Red: the optimal blend of ten of the perturbed edge types. 

After the noise, none of the original metrics preserved the original clustering structure. We display 
this in Fig. [2j which presents histograms for the holding power for vertices. The green bars correspond to 
vertices of the original graph, they all have positive holding power. The blue bars correspond to holding 
powers after noise is added. We only present one edge type for clarity of presentation. As can be seen a 
significant portion (30%) of the vertices have negative holding power, which means they would rather be on 
another cluster. The red bars show the holding powers after we compute an optimal linear aggregation. As 
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seen in the figure, almost all vertices move to the positive side, justifying the ground-truth clustering. A few 
vertices with negative holding power are expected, even after an optimal solution due to the noise. These 
results show that a composite similarity that resonates with a given clustering can be computed out of many 
metrics, none of which give a good solution by itself. 

In Table [T] we present these results on graphs with different number of vertices. While the percentages 
change for different number of vertices, our main conclusion that a good clustering can be achieved via a 
better aggregation function remains valid. 



Table 1 : Fraction of nodes with positive holding power for ground-truth, perturbed, and optimized graphs 



Number of nodes 


Number of clusters 


Ground-truth 


Optimized 


Perturbed (average) 


500 


14 


.965 


.922 


.703 


1000 


27 


.999 


.977 


.799 


2000 


58 


.999 


.996 


.846 


4000 


118 


1.00 


.997 


.860 



File system data: An owner of a workstation classified 300 of his files as belonging to one of his three 
ongoing projects, which we took as the ground-truth clustering. We used filename similarity, time-of- 
modification/time-of-creation similarity, ancestry (distance in the directory- tree), and parenthood (edges 
between a directory node with file nodes in this directory) as the similarity metrics among these files. 

FileSystem Solutions 
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Figure 3 : Optimal solutions for the file system data for different arctangent parameters 

Our results showed that only three metrics (time-of-modification, ancestry, and parenthood) affected the 
clustering. However, the solutions were sensitive to the choice of the arctangent parameter. In Fig. [3j each 
column corresponds to an optimal solution for the corresponding arctangent parameter. Recall that higher 
values of the arctangent parameter corresponds to sharper step functions. Hence, the right of the figure 
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corresponds to maximizing the total number of vertices with positive holding power, while the left side 
corresponds to maximizing the sum of holding powers. The difference between the two is that the solutions 
on the right side may have a lot of nodes with barely positive values, while those on the left may have nodes 
further away from zero at the cost of more nodes with negative holding power. This is expected in general, 
but drastic change in the optimal solutions as we go from one extreme to another was surprising to us, and 
should be taken into account in further studies. 

Arxiv data: We took 30,000 high-energy physics articles published on arXiv.org and considered abstract 
text similarity, title similarity, citation links, and shared authors as edge types for these articles. We used the 
top-level domain of the submitter's e-mail (.edu, .uk, .jp, etc..) as a proxy for the region where the work 
was done. We used these regions as the ground-truth clustering. 

The best parameters that explained the ground-truth clustering were 0.0 for abstracts, 1.0 for authors, 
0.059 for citations, and 0.0016 for titles. This means the shared authors edge type is almost entirely favored, 
with cross-citations coming a distant second. This is intuitive because a graph of articles linked by common 
authors will be linked both by topic (we work with people in our field) but also by geography (we often work 
with people in a nearby institutions) whereas edge types like abstract text similarity tend to encode only the 
topic of a paper, which is less geographically correlated. Different groups can work on the same topic, and 
it was good to see that citations factored in, and such a clear dominance of the authors information was 
noteworthy. As a future work, we plan to investigate nonlinear aggregation functions on this graph. 

3.3.2 Clustering quality vs. holding vertices 

We have stated two goals while computing an aggregation function: justifying the position of each vertex 
and the overall quality of clustering. In Fig. |4} we present the Pareto frontier for the two objectives. The 
vertical axis represent the quality of the clustering with respect to the modularity metric fTSl, while the 
horizontal axis represents the percentage of nodes with positive holding power. The modularity numbers 
are normalized with respect to the modularity of the ground-truth clustering, and normalized numbers can 
be above 1, since the ground-truth clustering does not specifically aim at maximizing modularity. 

As expected. Fig. |4] shows a trade-off between two objectives. However, the scale difference between 
the two axis should be noted. The full range in modularity change is limited to only 3% for modularity, 
while the range is more than 20% for percentage of vertices with positive holding power. More importantly, 
by only looking at the holding powers we can be preserve the original modularity quality. The reason for 
this is that we have relatively small clusters, and almost all vertices have a connection with a cluster besides 
their own. If we had clusters where many vertices had all their connection within their clusters (e.g., much 
larger clusters), then this would not have been the case, and having a separate quality of clustering metric 
would have made sense. However, we know that most complex networks have small communities no matter 
how big the graphs are (91. Therefore, we expect that looking only at the holding powers of vertices will be 
sufficient to recover aggregation functions. 
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Figure 4: Pareto frontier for two objectives: normalized modularity and percentage of nodes with positive holding 
power 

3.3.3 Inverse problems vs. maximizing clustering quality 

We used the file system data set to investigate the relationship between the two proposed approaches, and 
we present our results in Fig. [5] For this figure, we computed the objective function for the ground-truth 
clustering for various aggregation weights and used the same weights to compute clusterings with Graclus. 
From these clusterings we computed the variation of information (VI) distance to the ground-truth. Fig. |5] 
presents the correlation between the measures: VI distance for Graclus clusterings for the first approach, 
and the objective function values for the second approach. This tries to answer whether solutions with 
higher objective function values yield clusterings closer to the ground-truth using Graclus. In this figure, 
a horizontal line fixed at 1 would have been ideal, indicating perfect correlation. Our results nevertheless, 
show a very strong correlation between the two. These results are not sufficient to be conclusive as we 
need more experiments and other clustering tools. However, this experiment produced promising results 
and shows how such a study may be performed. 

3.3.4 Runtime scalability 

In our final set of experiments we show the scalability of the proposed method. First, we want to note that 
the number of unknowns for the optimization problem is only a function of the aggregation function and 
is independent of the graph size. The required number of operations for one function evaluation on the 
other hand depends linearly on the size of the graph, as illustrated in Figure |6] In this experiment, we used 
synthetic graphs with 30 as the average degree, and the presented numbers correspond to averages on 10 
different runs. As expected, the runtimes scale linearly with the number of edges. 

The runtime of the optimization algorithm depends on the number of function evaluations. Since the 
algorithm we used is nondeterministic, the number of function evaluations, hence runtimes vary even for 
different runs on the same problem, and thus are less informative. We are not presenting these results in 
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Correlation of Arctangent with True VI Scores 
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Figure 5: The correlation of the arc tan- smoothed objective function with variation of information distance using 
clusterings generated by Graclus as we vary the steepness parameter 

detail due to space constraints. However, we want to reemphasize that the size of the optimization problem 
does not grow with the graph size, and we don't expect the number of functions evaluations to cause any 
scalability problems. 

We also observed that the number of function evaluations increase linearly with the number of similarity 
metrics. These results are also omitted due to space constraints. 

4 Finding latent clusters 

Consider the situation where several edge types share redundant information yet as an ensemble combine to 
form some broader structure. For example scientific journal articles can be connected by text similarity, ab- 
stract similarity, keywords, shared authors, cross-citations, etc.... Many of these edge types reflect the topic 
of the document while others are also influenced by the location of the work. Text, abstract, and keyword 
similarity are likely to be redundant in conveying topic information (physics, math, biology) while shared 
authorship (two articles sharing a common author) is likely to convey both topic and location information 
because we tend to work with both those in our same field and with those in nearby institutions. We say that 
the topic and location attributes are latent because they do not exist explicitly in the data. We can represent 
much of the variation in the data by two relatively independent clusterings based on the topic of documents 
and their location. This compression of information from five edge types to two meaningful clusterings is 
the goal of this paper. 
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Scalability of Pull Computation 
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Number of Edges 
Figure 6: Scalability of the proposed method 
4.1 An illustrative problem 

We construct a graph with multiple edge types to demonstrate latent classes. For illustration, we assume our 
graph is perfectly embedded in as seen in Fig. [t^. In this example each point on the plane represents 
a vertex, and two vertices are connected by an edge if they are close in distance. The similarity/weight for 
each edge is inversely proportional to the Euclidean distance. We see visually that there are nine natural 
clusters. More interestingly we see that these clusters are arranged symmetrically along two axes. These 
clusters have more structure than the set {1, 2, 3, 9}. Instead they have the structure {l,2,3}x{l,2,3}. 
An example of such a structure would be the separation of academic papers along two factors, {Physics, 
Mathematics, Biology} and {West Coast, Midwest, East Coast}. The nine clusters (with examples like 
physics articles from the West or biology articles from the Midwest) have underlying structure. 

Our data sets do not directly provide this information. For instance with journal articles we can col- 
lect information about authors, where the articles are published, and their citations. Each of these as- 
pects provides only a partial view of the underlying structure. Analogous to our geometric example above 
we could consider features of the data as projections of the points to one dimensional subspaces. Dis- 
tances/similarities between the points in a projection have only partial information. This is depicted picto- 
rially in Fig. [TJ). For instance, the green projection represents a metric that clearly distinguishes between 
columns but cannot differentiate between different communities on the same column. The red projection 
on the other hand provides a diagonal sweep, capturing partial information about columns and partial in- 
formation about rows. Neither of the two metrics can provide the full information for the underlying data. 
However when considered as an ensemble they do provide a complete picture. Our goal is to be able to 
tease out the latent factors of data from a given set of partial views. 

In this paper, we will use this 3x3 example for conceptual purposes and for illustrations. Our approach 
is construct many multi-weighted graphs by using combinations of the partial views of the data. We will 
cluster these graphs and analyze these clusters to recover the latent structure. 

We expect that different regions of this space will have different clusterings. How drastic these dif- 
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(a) 270 vertices arranged in nine clusters on the plane. 
Edges exist between vertices so that close points are 
well connected and distant points are poorly con- 
nected. 



(b) Two ID graphs arranged to suggest their relation- 
ship to the underlying 3x3 community structure. Both 
have clear community structures that are related but 
not entirely descriptive of the underlying 3x3 commu- 
nities. 



Figure 7: Illustrating clusters (a) underlying structure and (b) low-dimensiona/partial views 



ferences are will depend on the particular multiweighted graph. How can we characterize this space of 
clusterings? Are there homogeneous regions, easily identifiable boundaries, groups of similar clusterings, 
etc.? We investigate the existence of a meta-clustering structure. That is we search for whether or not 
several clusterings in this space exhibit community structure themselves. In this section, we present our 
methods for these questions on the 3 x 3 data. We will later provide results on a larger data set. 

4.2 Sampling the clustering space 

To inspect the space of clusterings we sample in a Monte Carlo fashion. We take points ai G such that 
\di\ = 1, and compute the appropriate graph and clustering at each point. We may then compare these 
clusterings in aggregate. 

As our first experiment, we take 16 random one-dimensional projections of the points laid out in the 
plane shown in Fig. [7] and consider the projected-point-wise distances in aggregate as a multiweighted 
graph. From this multiweighted graph we take 800 samples of the linear space of clusterings. These 800 
clusterings approximate the clustering structure of the multiweighted graph. 

The results of these experiments are presented in Figure 8(a)[ In this figure each row and column 



corresponds to a clustering of the graph. Entries in the matrix represent the variation of information distance 
between two clusterings. Therefore dark regions in this matrix are sets of clusterings that are highly similar. 
White bands show informational independence between regions. The rows/columns of this matrix have been 
ordered to have more similar clusterings closer to each other so as to highlight the clusters of clusterings 
detected. 
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VI Distances between the set of randomly sampled clusterings 




Figure 8: The Meta-clustering information (a) VI distances between 800 sampled clusterings. Vertices are ordered 
to show optimal clustering of this graph. Dark blocks on the diagonal represent clusters. The white band is a group of 
completely independent clusterings, (b) Three Clusterings treated as nodes in a graph. Similar clusterings (top two) 
are connected with high- weighted edges. Distant clusterings are connected with low- weighted edges. 



4.3 Meta-clusters: clusters of clusterings 

While it is interesting to know that significantly different clusterings can be found, the lack of stable clus- 
tering structure is not helpful for applications of clustering such as for unsupervised learning. We need to 
reduce this set of clusterings further. We approach this problem by applying the idea of clustering onto this 
set of clusterings. We call this problem the meta-clustering problem. 

We represent the clusterings as nodes in a graph and connect them with edge- weights determined by the 
inverse of the variation of information metric 1 1 1 1. We inspect this graph to see if it contains clusters. That 
is, we cluster the graph of clusterings to see if there exist some tightly coupled clusters of clusterings within 
the larger space. For instance in Fig. |8(b)| the top two clusterings differ only in the position of a single 



vertex and thus are highly similar. In contrast the bottom clustering is different from both and is weakly 
connected. 

Figure [8(a)] reveals the meta-clustering structure in our experiments. The dark blocks around the di- 



agonal correspond to meta-clusters. We can see two big blocks in the upper left and lower right corners. 
Furthermore, there is a hierarchical clustering structure within these blocks, as we see smaller blocks within 
the larger blocks. In this experiment, we were able to observe meta-clusters. As usual, results depend on the 
particular problem instance. While we do not claim that one can always find such meta-clusters, we expect 
that they will exist in many multi-weighted graphs, and exploiting the meta-clustering structure can enable 
efficiently handling this space, which is the topic of the next section. 

4.4 Efficient representation of the clusterings 

In this section we study how to efficiently represent the meta-clustering structure. First we will study how 
to reduce a cluster of clusterings into a single averaged or representative clustering. Then, we will study 
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how to select and order a small number of meta-clusters to cover the clustering space efficiently. 



4.4.1 Averaging clusterings within a cluster 

Three Clusterings over the same nodes 



An Averaged 
Clustering Graph 




Figure 9: Showing the CSPA |T8| averaging procedure for clusterings. Each clustering is displayed as a block 
diagonal graph (or permutation) with two nodes connected if and only if they are in the same cluster. Then an 
aggregate graph (right) is formed by the addition of these graphs. This graph on the right is then clustered using a 
traditional algorithm. This clustering is returned as the representative-clustering. 

To increase the human accessibility of this information we reduce each cluster of clusterings into a single 
representative clustering. We use the "Cluster-based Similarity Partitioning Algorithm" (CSPA) proposed 
by Strehl et. al p8| to combine several clusterings into a single average. In this algorithm each pair of 
vertices is connected with an edge with weight equal to the number of clusters in which they co-occur. If Va 
and vi) are in the same cluster in k of the clusterings then in this new graph they are connected with weight 
k. If they are never in the same cluster then they are not connected. We then cluster this graph and use the 
resultant clustering as the representative. In Fig. |9]we depict the addition of three clusterings to form an 
average graph which can then be clustered. 



We perform this process on the clusters of clusterings found in section 4.3 and presented in Fig. 8(a) 



to obtain the representative-clusterings in Fig. [TOj We see that the product of the first two representative- 
clusterings identifies the original nine clusterings with little error. We see also that the two factors are 
identified perfectly by each of these clusterings individually. 
4.4.2 Ordering by set- wise information content 



In Fig. [To} the original 3x3 community structure can be reconstructed using only the first two representative- 
clusterings. Why are these two chosen first? Selecting the third and fourth representative-clusterings would 
not have had this pleasant result. How should we order the set of representative-clusterings? 

We may judge a set of representative-clusterings by a number of factors: (i) How many of our samples 
ascribe to the associated meta-clusters, what fraction of the space of clusterings do they cover? (ii) How 
much information do the clusterings cover as a set? (Hi) How redundant are the clusterings? How much 
informational overlap is present? We would like to maximize information while minimizing redundancy. 



In Fig. 10 we ordered the representative-clusterings to maximize setwise information. Minimizing redun- 
dancy came as a fortunate side-effect. Notice how each of the clusterings in order is independent from the 
preceding ones. Knowing that a vertex is red in the first image tells you nothing about the color of the vertex 
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Figure 10: Representative-Clusterings of the four dominant clusters-of-clusterings from Fig. 8(a) Clusterings are 
displayed as colorings of the original points in the 2-d plane. These are ordered to maximize cumulative set- wise 
information. Notice how the first two representative-clusterings recover the original nine clusterings exactly. 



in the second. The second therefore brings only novel information and no redundancy. 

To compute the information content of a set of clusterings we extend the Variation of Information metric 



in a natural way. In Section [Z2| we introduced the mutual information of two clusterings as follows: 

K Kf 

k=l 1=1 

where P() is the probability that a randomly selected node was in the specified clusters. This is equivalent 
to the self-information of the Cartesian product of the two clusterings. Its extension to a set of clusterings 

I{Ca, Ci3, . . . , Coj) is 

K K' K'" 

- • •^P(Cc,C/3, . . . ,C^,a,6, . . . ,2:)logP(Cc,C/3, . . . ,C^,a,6, . . . 

a=l h=l z=l 

For a large number of clusterings or large K this quickly becomes inconvenient. In these cases we order 
the clusterings by adding new clusterings to the set based on maximizing the minimum pairwise distance 
to every other clustering currently in the set. This process is seeded with the informationally maximal pair 
within the set. This does not avoid triple- wise information overlap but works well in practice. 



4.5 Physics Articles from arXiv.org 

ArXiv.org releases convenient metadata (title, authors, etc..) for all articles in their database. Additionally, 
a special set of 30 000 high energy physics articles are released with abstracts and citation networks. We 
apply our process to this graph of papers with edge types Titles, Authors Abstracts and Citations. 

Articles are connected by title or abstract based on the cosine similarity of the text (using the bag of 
words model 1 1 1). Two articles are connected by author by the number of authors that the two articles have 
in common. Two articles are connected by citation if either article cites the other (undirected). We inspect 
this system with the following process discussed in greater detail above. 
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VI Distances Between Clusterings 




Dendrogram of Graph of Clusterings 



50 100 150 200 250 300 350 




(a) 



(b) 



Figure 11: (a) The pairwise distances between the sampled clusterings form a graph. Note the dark blocks along the 
diagonal. These are indicative of tightly knit clusters, (b) A dendrogram of this graph. We use the ordering of the 
vertices picked out by the dendrogram to optimally highlight the blocks in the left image. 

These graphs are normalized by the L2 norm and then the space of composite edge types is sampled 
uniformly. That is cjj — ^^^^ a^it;^, where G (—1,1) , wi G {titles, abstract, authors, citation]. 
The resulting graphs are then clustered using Clauset et al's FastModularity ||3| algorithm. The resulting 
clusterings are compared in a graph which is then clustered to produce clusters of clusterings. The clusters 
of clusterings are averaged [[T8| and we inspect the resultant representative-clusterings. 



The similarity matrix of the graph of clusterings is shown in Fig. 1 1(a) The presence of blocks on the 
diagonal imply clusters of clusterings. From this process we obtain representative-clusterings. The various 
partitionings of the original set of papers vary considerably (large VI distance) yet exhibit high modularity 
scores implying a variety of high-quality clusterings within the dataset. 

Analysis of this dataset is challenging and still in progress. We can look at articles in a clustering 
and inspect attributes like the country (by submitting e-mail's country code), or words which occur more 
often than statistically expected given the corpus. Most clusterings found show a separation into various 
topics identifyable by domain experts (example in Table [2]) however a distinction between clusterings has 



not yet been found. While the VI distance between metaclusterings presented in Fig. |1 1(a)] is large it has 
so far proven difficult to identify the qualitative distinction for the quantitative difference. More in depth 
inspection by a domain expert may be necessary. 



5 Finding unexpected clusters 

In many applications, a domain expert can predict what the clusters will look like before computing a 
clustering. For instance, if we have a graph whose vertices are countries with edges representing strong 
ties between the countries, we expect the geographic structure to have a strong influence on the clusters for 
many quantifications of a "strong tie." While this domain specific information can be helpful in many ways. 
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Table 2: Commonly appearing words (stemmed) in two distinct representative-clusterings. Clusters within each 
clustering correspond to well known subfields in High-Energy Physics (Traditional Field Theory/Lattice QCD, Cos- 
mology /GR, Supersymmetry /String Theory). This data however does not show a strong distinction between the clus- 
terings. Furher investigation is warranted. 



Cluster 


Statistically Significant Words in Clustering 1 


1 


quantum, algebra, integr, equat, model, chern-simon, lattic, particl, affin 


2 


potenti, casimir, self-dual, dilaton, indue, cosmolog, brane, anomali, scalar 


3 


black, hole, brane, supergrav, cosmolog, ads/cft, sitter, world, entropi 


4 


cosmolog, black, hole, dilaton, graviti, entropi, dirac, 2d, univers 


5 


d-brane, tachyon, string, matrix, theori, noncommut, dualiti, supersymmetr, n=2 


Cluster 


Statistically Significant Words in Clustering 2 


1 


potenti, casimir, self-dual, dilaton, indue, energi, scalar, cosmolog, gravit 


2 


integr, model, toda, equat, function, fermion, casimir, affin, dirac 


3 


tachyon, d-brane, string, orbifold, n=2, n=l, dualiti, type, supersymmetr 


4 


black, hole, noncommut, supergrav, brane, sitter, entropi, cosmolog, graviti 



such a bias may obscure critical information about the data. For instance, we would like to know whether 
our graph has any good clusterings that are significantly different than a given list of clusterings, which 
brings us to the topic of this section: Given a graph with multiple edge types, and a set of clusterings, how 
do we compute an aggregate weighting function, such that clusters of the aggregate graph will be maximally 
different that the given set of clusterings? 

This problem can be most naturally modeled as a multi-criteria optimization problem, where the first 
criterion will be the quality of the clustering and the second criterion will be the information independence 
from the given clusterings. We need high quality clusterings, since we want the clustering to show significant 
value. One can impose a clustering structure even to an Erdos-Renyi style graph, but such clusters will not 
have any value. We want clusters that are statistically significant. We also want to learn something that we 
do not already know, hence the second criterion, information independence. If there are multiple clusters 
we want to deviate from, then the second criterion will hold multiple criteria within. 

The methods we have proposed so far can be employed to solve this problem as well. Again, we need 
to quantify the information independence between two clusterings, for which the variation of information 
metric, described in Section [2^ can be used. And we need a method to search the space for the aggregate 



function efficiently, which can be done using the optimization methods adopted in Section [3^23 
5.1 Case study: countries 

We applied our methods to a graph of the world where each node is a country and modeled the relations 
between these countries with 6 types of edges: 1) common land border, 2) major trade partners, 3) common 
language, 4) common religion, 5) common ethnicity, 6) similar labor force distribution (industry, agri- 
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Figure 12: The world map with some edge types plotted. Top: Export/Import graphs (black and red) and physical 
land connections (blue). Bottom: Religion(black) and ethnicity (blue). 

culture, services, etc.). This data was taken from the CIA World Factbook 2006, which is conveniently 
available in structured text format [[2|. The edge types borders and languages are unweighted while the oth- 
ers are weighted based on the probability that two people from these countries share the same trait. That is 
the probability that a Mexican citizen and American citizen share a common religion is .24, which becomes 
the weight of the edge. The edge types of this graph are various in nature. Some are small world, some 
exhibit hubs and authorities, some have skewed degree-distributions, others do not (see Figure[T2]). Many of 
the edge types are correlated geographically. Ethnic, linguistic, religious, and trade relationships are often 
local. Can we find clusterings which are substantially different from this rule? 

5.1.1 Searching with a linear aggregate function 

As in Section |3] we optimized over the space of linear combinations of the edge types, but this time we 
allowed the weight of each edge ai G [—1,1], but we did not allow the weight of any edge to be negative, 
by truncating at 0. We used the HOPSPACK derivative-free optimization package and a coefficient to 
balance between the two criteria. Clustering quality is given by the modularity metric and Clauset et al's 
software | 3 1 is used to find the clusterings. 

In Table [3} we present one of the clusterings we found. We note that the modularity score shows that 
the clusters are statistically significant, and the distance from the continents-based clustering is high, thus 
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Table 3: Groups of a clustering with maximum information independence from continent-based clustering. The 
modularity of the clustering is 0.397, and the VI distance is 2.6. The solution used linear weights of labor(-0.99), 
exports (-0.5), language(0.5), imports (-0.61), religion(-0.06), borders (-0.95), and ethnicity(0.95) 



Cluster 1 


Angola, Belgium, Cambodia, Republic of the Congo, China, 
Egypt, Greece, Hong Kong, Italy, Laos, Mali, Malaysia, Nigeria 
Rwanda, Singapore, Thailand, Taiwan, Papua New Guinea 


Cluster 2 


Denmark, Mongolia, Macedonia, Turkey, Turkmenistan 


Cluster 3 


Belarus, Georgia, Kyrgyzstan, Kazakhstan, Latvia, Lithuania 
Serbia, Russia, Ukraine, Uzbekistan 


Cluster 4 


Bangladesh, Benin, Canada, Cameroon, Ireland, France, Ghana 
Germany, Haiti, Indonesia, India, Israel, Cote dTvoire, Jamaica 



this clustering provides new information. This solution strongly favors aspects like language, which is less 
geographically linked than others and strongly disfavors aspects like the sharing of a land border. The 
clusters obtained are at best vaguely related to the continental clustering. One can see traces of pre World 
War I empires. Some of the clusters do highlight interesting linguistic groups that are conveniently spread 
across continental borders. Cluster 2 contains several countries with Turkic roots. Cluster 3 contains a Slavic 
family. However, it is noteworthy that some countries with Turkic roots, such as Kyrgyzstan, Kazakhstan, 
and Uzbekistan, were grouped with the Slavic family. This shows that this high quality clustering cannot 
be explained by language and/or ethnicity data alone. It is the combination that leads to this particular 
clustering. 



5.1.2 Drawback of linear aggregation 

In addition to some interesting results, our experiments also pointed at some potential drawbacks of using 
a linear aggregate function. We have observed that it is possible to create clusterings with significant in- 
formation difference, but these clusterings are not necessarily high in quality and/or statistically significant. 
And the reason for this is that linear aggregation may yield dense or sparse but Erdos-Renyi like random 
graphs. In this section, we will explain why this is the case. 

The adjacency matrix of an undirected non-negatively weighted 
graph is symmetric positive semi-definite, and thus can be pictured 
as ellipsoids with the major axes along the eigenvectors with the 
length of any axis corresponding to the eigenvalue. If the matrix 
has few significant eigenvalues then the ellipsoid is flat in many 
directions and the matrix/graph can be simplified to a lower dimen- 
sional representation without much loss (see Figure [13]), which is 

the underlying idea of spectral clustering. A random or complete i o a i . • r i • . 

^ ^ ^ ^ rigure 1 3 : A clustering of a graph is just 

graph on the other hand is closer to a symmetric sphere and cannot ^ low-rank approximation 
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be reduced without substantial loss. 




(a) The addition of two adjacency matrices produces a more 
uniformly symmetric graph 




(b) The multipUcation of two adjacency matrices produces a 
more degenerate, modular graph, amenable to low-rank ap- 
proximations/clustering. 



Figure 14: Illustration of effects of additive (a) and multiplicative (b) aggregate function on the space of clusterings. 



This raises another interesting question: Can we use multiple edge types to improve the statistical 
significance of clusters? For instance, a multiplicative, as opposed to additive aggregate function may 
restrict the search space (as illustrated in Figure [14]). In this case, we only take connections endorsed by 
more than one edge type into account. We demonstrate this with an example. Clustering the countries graph 
with respect to only the religion and only language edge types produce modularity scores of 0.44 and 0.3 
respectively. If we use a graph where we have an edge if either a language or a religion edge exists, the 
modularity score decreases to 0.23, which means the clusters are statistically less significant. On the other 
hand, if we use a graph, where we have an edge only if we have both language and a religion edge, the 
modularity increases to 0.47, which means that the clusters have much more significance. 



5.2 Finding clusters with higher significance by using pair-products 

In the previous section, we discussed how significance of clusterings can be improved by looking at products 
of edge types (i.e. we an edge only if both edge types support it). In this section, we present results based on 
this idea. We computed the modularity scores and VI distances from the continental clustering for all single 
edge type and the product of each pair of edge types. We display ones with interesting results in Table |4j 



Using the same approach as in Section [4.4. 2[ we select pairs that maximize set- wise information away 
from the continental clustering. We found that the clustering that is most different from the continental 
clustering is given by the product of the language and ethnicity graphs. Given those two clusterings the next 
most informationally distant comes from looking at import-export relationships. Finally, after the countries 
have been classified along these edge types the religion graph is the most distinguishing. Note that the 
products yield very high quality clusterings which are at the same time distant from each other. 

The three clusterings resulting from these three graphs are sufficiently distinct to distinguish almost 
all of the different countries from each other. The clusterings are sufficiently distant so that almost no 
two countries lie in the same cluster in all three of the clusterings. Equivalently most countries can be 
identified by their membership in the three clusterings. This achieves the goal set forth in Section |4] to 
efficiently represent a graph through well chosen representative clusterings, drastically reducing information 
with minimal content loss. 

Another interesting result in our experiments was the groups of countries that were always grouped 
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Table 4: A few of the product-pairs and singleton edge types listed along with their modularity score and variation of 
information distance from the continental clustering. 



Single Metric 


Metric Pair 


Name 


Modularity 


VI distance 


Name 


Modularity 


VI distance 


language 


0.25 


1.79 


language x ethnicity 


0.43 


2.40 


religion 


0.37 


2.23 


religion x ethnicity 


0.45 


2.16 


border 


0.42 


0.69 


imports x religion 


0.41 


2.17 


exports 


0.29 


1.94 


imports x labor 


0.29 


1.85 








exports X imports 


0.37 


2.35 








ethnicity x exports 


0.41 


1.93 








labor x language 


0.26 


2.04 








labor X exports 


0.28 


1.95 



I Continents | Language x Ethnicity | Imports x Exports | Religion | ... 

Table 5: An set of edge types ordered to yield maximum set- wise information / minimum set- wise redun- 
dancy. Each in turn is as different as possible from each of the preceding clusterings. Note that these 
represent relatively orthogonal ideas (geography, culture, economy, ...) in turn. 

together, which are listed in Table [6j This list will not be surprising to many, but it still underlines the 
concept of metric invariant clusters in graphs with multiple edge types. 

6 Conclusions 

We addressed clustering in the context of graphs with multiple edge types. We investigated several prob- 
lems within this context: recovering an aggregation scheme from ground-truth clustering, finding a meta- 
clustering structure and efficiently representing this structure, and finding unexpected clusters. We also 
presented case studies on real data sets. 

The main result of our work is that working on graphs with multiple edge types prevents loss of crucial 



Table 6: Some groups of countries that were consistently clustered together 



Group 1 


Lebanon, Syria, Yemen 


Group 2 


Norway, Sweden 


Group 3 


Belgium, Denmark, Netherlands 


Group 4 


Colombia, Dominican Republic, Ecuador, El Salvador, Guatemala 
Honduras, Mexico, Nicaragua, Peru, Venezuela 
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information. Instead of a single clustering, a rich clustering structure can exist with clusters of clusterings, 
and latent clusters can be discovered that compactly explain the underlying graph. We showed that high 
quality clusters which are significantly different than expected can be found, and significance of clusters 
can be improved by looking at intersections of multiple edge types. Another important conclusion of this 
work is that, despite the increased complexity due to multiple edge types, we have the algorithmic tools to 
make the associated problems tractable. 

We hope that our work will draw the attention of the research community to this important problem, 
which bears many intriguing research challenges, and we can see much room for growth in this topic. We 
have mostly focused on linear aggregation and nonlinear combinations (as we have seen in Section |5.2| ) 
should be investigated. More intelligent sampling methods should be possible. And most importantly, 
working on more data sets will give us a chance to better evaluate our methods. 
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